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Abstract 

Mesh adaption is a powerful tool for efficient un- 
structured grid computations but causes load imbal- 
ance on multiprocessor systems . To address this prob- 
lem, we have developed PLUM, an automatic portable 
framework for performing adaptive large-scale numer- 
ical computations in a message-passing environment. 
This paper makes several important additions to our 
previous work. First, a new remapping cost model is 
presented and empirically validated on an SP2. Next, 
our load balancing strategy is applied to sequences of 
dynamically adapted unstructured grids. Results indi- 
cate that our framework is effective on many proces- 
sors for both steady and unsteady problems with sev- 
eral levels of adaption. Additionally, we demonstrate 
that a coarse starting mesh produces high quality load 
balancing, at a fraction of the cost required for a fine 
initial mesh . Finally, we show that the data remapping 
overhead can be significantly reduced by applying our 
heuristic processor reassignment algorithm. 

1 Introduction 

Dynamic mesh refinement/coarsening on unstruc- 
tured grids is a powerful tool for computing large-scale 
problems that require grid modifications to efficiently 
resolve solution features. Unfortunately, the adaptive 
solution of unsteady problems causes load imbalance 
among processors on a parallel machine because the 
computational intensity is both space and time depen- 
dent. Various dynamic load balancing methods have 
been reported to date; however, most of them lack a 
global view of loads across processors. 

Our goal is to build a portable system for effi- 
ciently performing adaptive large-scale numerical cal- 
culations in a parallel message- passing environment. 
Figure 1 depicts our framework, called PLUM, for such 
an automatic system. The mesh is first partitioned and 
mapped among the available processors. A solver then 
runs for several iterations, updating solution variables. 
Once an acceptable solution is obtained, a mesh adap- 
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tion procedure is invoked. It first targets edges for 
coarsening and refinement based on an error indicator 
computed from the numerical solution. The old mesh 
is then coarsened, resulting in a smaller grid. Since 
edges have already been marked for refinement, it is 
possible to exactly predict the new mesh before actu- 
ally performing the refinement step. Program control 
is thus passed to the load balancer at this time. A 
quick evaluation step determines if the new mesh will 
be so unbalanced as to warrant a repartitioning. If 
the current partitions will remain adequately load bal- 
anced, control is passed back to the subdivision phase 
of the mesh adaptor. Otherwise, a repartitioning pro- 
cedure is used to divide the new mesh into subgrids. 
The new partitions are then reassigned to the proces- 
sors in a way that minimizes the cost of data move- 
ment. If the remapping cost is less than the compu- 
tational gain that would be achieved with balanced 
partitions, all necessary data is appropriately redis- 
tributed. Otherwise, the new partitioning is discarded. 
The computational mesh is then actually refined and 
the numerical calculation is restarted. 

Extensive details of the parallel mesh adaption 
scheme, called 3D_TAG, that is used in this work is 
given in [6]. The parallel version consists of C++ and 
MPI code wrapped around the original serial mesh 
adaption program [3]. An object-oriented approach 
allowed the distributed-memory implementation to be 
performed in a clean and efficient manner. Notice from 
the framework in Fig. 1 that splitting the mesh refine- 
ment step into two distinct phases of edge marking and 
mesh subdivision allows the subdivision phase to oper- 
ate in a more load balanced fashion. In addition, since 
data remapping is performed before the mesh grows 
in size due to refinement, a smaller volume of data is 
moved. This, in turn, leads to significant savings in 
the redistribution cost. 

2 Dynamic Load Balancing 

PLUM is a novel method to dynamically balance 




Figure 1: Overview of PLUM, our framework for parallel adaptive numerical computation. 


the processor workloads with a global view. Results re- 
ported earlier either focused on fundamental load bal- 
ancing issues [7] or various refinement strategies [2,5] 
to demonstrate the viability and effectiveness of our 
framework. This paper presents, for the first time, 
the application of PLUM to sequences of dynamically 
adapted unstructured grids. We also present a data 
remapping cost model that can accurately predict the 
total cost of data redistribution on an SP2 given the 
number of tetrahedral elements that have to be moved 
among the processors. 

Our load balancing procedure has five novel fea- 
tures: (i) a dual graph representation of the initial 
computational mesh keeps the complexity and connec- 
tivity constant during the course of an adaptive com- 
putation; (ii) a parallel mesh repartitioning algorithm 
avoids a potential serial bottleneck; (iii) a heuristic 
remapping algorithm quickly assigns partitions to pro- 
cessors so that the redistribution cost is minimized; 
(iv) an efficient data movement scheme allows remap- 
ping and mesh subdivision at a significantly lower cost 
than previously reported; and (v) accurate metrics es- 
timate and compare the computational gain and the 
redistribution cost of having a balanced workload after 
each mesh adaption step. 

Using the dual of the initial computational mesh 
for the purpose of dynamic load balancing is one of 
the key features of this work. Each dual graph ver- 
tex has two weights associated with it. The computa- 
tional weight, Wcomp, models the workload for the cor- 
responding element. The remapping weight, w rem ap, 
models the cost of moving the element from one pro- 
cessor to another. The weight w comp is set to the num- 
ber of leaf elements in the refinement tree because only 
those elements that have no children participate in the 
numerical computation. The weight Wremap, however, 


is set to the total number of elements in the refine- 
ment tree because all descendants of the root element 
must move with it from one partition to another, if 
so required. Every edge of the dual graph also has 
a weight, u> C omm) that models the runtime interpro- 
cessor communication. The value of w QO mm is set to 
the number of faces in the computational mesh that 
corresponds to the dual graph edge. The mesh con- 
nectivity, u>comp> and tfcomm together determine how 
dual graph vertices should be grouped to form par- 
titions that minimize both the disparity in the par- 
tition weights and the runtime communication. The 
Wremap determines how partitions should be assigned 
to processors such that the cost of data redistribution 
is minimized. New computational grids obtained by 
adaption are translated to u> comp and u; rema p for every 
vertex and to w comm for every edge in the dual mesh. 

If a preliminary evaluation step determines that 
the dual graph with a new set of u; comp is unbalanced, 
the mesh needs to be repartitioned. A good partitioner 
should minimize the total execution time by balancing 
the computational loads and reducing the interproces- 
sor communication time. In addition, the repartition- 
ing phase must be performed very rapidly for our load 
balancing framework to be viable. For the test cases 
in this paper, an alpha version of ParMeTiS [4], a 
parallel multilevel algorithm, was used as the repar- 
titioner. Results indicate that this partitioner can be 
effectively used inside PLUM; however, any other al- 
gorithm can also be used as long as it quickly delivers 
partitions that are reasonably balanced and require 
minimal overhead. 

Once new partitions are obtained, they must be 
mapped to processors such that the redistribution cost 
is minimized. In general, the number of new partitions 
is an integer multiple F of the number of processors. 














Each processor is then assigned F unique partitions. 
The first step toward processor reassignment is to com- 
pute a similarity measure 5 that indicates how the 
remapping weights w cemap of the new partitions are 
distributed over the processors. It is represented as 
a matrix where entry Sij is the sum of the u; rema p 
of all the dual graph vertices in new partition j that 
already reside on processor i . A similarity matrix for 
P = 4 and F = 2 is shown in Fig. 2. Only the non-zero 
entries are shown. 


New Partitions 



Figure 2: A similarity matrix after processor reassignment 
using the heuristic algorithm and the TotalV metric. 

The goal of the processor reassignment phase is 
to find a mapping between partitions and processors 
such that the data redistribution cost is minimized. 
Various cost functions are usually needed to solve this 
problem for different architectures. In [5], we investi- 
gated two general metrics: TotalV, that minimizes the 
total volume of data moved among all processors, and 
MaxV, that minimizes the maximum flow of data to or 
from any single processor. TotalV assumes that by 
reducing network contention and the total number of 
elements moved, the remapping time will be reduced. 
Both an optimal and a heuristic greedy algorithm have 
been implemented for solving the processor reassign- 
ment problem using TotalV [5]. Applying the heuristic 
procedure to the similarity matrix in Fig. 2 generates 
the processor assignment shown in the bottom row. 
It was proved in [5] that a processor assignment ob- 
tained using the heuristic algorithm can never result 
in a data movement cost that is twice that of the op- 
timal assignment. MaxV, on the other hand, considers 
data redistribution in terms of solving a load imbal- 
ance problem, where it is more important to minimize 
the workload of the most heavily-weighted processor 
than to minimize the sum of all the loads. An optimal 
algorithm for solving the assignment problem using 
MaxV has also been implemented [5]. 

3 Remapping Cost Model 

Once the reassignment problem is solved, a model 
is needed to quickly predict the expected redistribu- 
tion cost for a given architecture. Accurately estimat- 


ing this time is very difficult due to the large number 
and complexity of the costs involved in the remapping 
procedure. The computational overhead includes re- 
building internal data structures and updating shared 
boundary information. Predicting the latter cost is 
particularly challenging since it is a function of the 
old and new partition boundaries. The communication 
overhead is architecture-dependent and can be difficult 
to predict especially for the many-to-many collective 
communication pattern used by the remapper. 

Our redistribution algorithm consists of three ma- 
jor steps: first, the data objects moving out of a par- 
tition are stripped out and placed in a buffer; next, 
a collective communication appropriately distributes 
the data to its destination; and finally, the received 
data is integrated into each partition and the bound- 
ary information is consistently updated. Performing 
the remapping in this bulk fashion, as opposed to send- 
ing individual small messages, has several advantages 
including the amortization of message start up costs 
and good cache performance. Additionally, the total 
time can be modeled by examining each of the three 
steps individually since the two computational phases 
are separated by the implicit barrier synchronization of 
the collective communication. The computation time 
can therefore be approximated as: 

a x max(ElemsSent) + f3 x max(ElemsRecd) -f 6, 
where a and ft represent the time necessary to strip 
out and insert an element respectively, and 6 is the 
additional cost of processing boundary information. 
The maximum values of ElemsSent and ElemsRecd 
can be quickly derived from the solved similarity ma- 
trix. Since the value of 6 is difficult to predict exactly 
and constitutes a relatively small part of the computa- 
tion, we assume that it is a small constant. To simplify 
our model even further, we assume that a = ft. 

Much work has been done to model communica- 
tion overhead including LogGP [1] and BSP [9]. Both 
models make the following assumptions which hold 
true for most architectures including the SP2: a receiv- 
ing processor may access a message or parts of it only 
after the entire message has arrived; and, at any given 
time a processor can either be sending or receiving a 
single message. Our redistribution procedure closely 
follows the superstep model of BSP. An advantage of 
the SP2 interconnection mechanism is that all nodes 
can be considered equidistant from one another. This 
allows us to predict communication overhead without 
the need to model multiple hops for individual mes- 
sages. We approximate our communication cost as: 

g x max(ElemsSent) + 9 x max(ElemsRecd) + /, 
where g is a machine-specific cost of moving a single 
element and / is the time for barrier synchronization. 


The total expected time for the redistribution pro- 
cedure can therefore be expressed as: 

7 x MaxSR + 0, 

where MaxSR = max(ElemsSent) + max(ElemsRecd), 
7 = a + g, and 0 = 6 + 1. In order to compute the 
slope and intercept of this linear function, several data 
points need to be generated for various redistribution 
patterns and their corresponding run times. A simple 
least squares fit can then be used to approximate 7 
and 0. This procedure needs to be performed only 
once for each architecture, and the values of 7 and 0 
can then be used in actual computations to estimate 
the redistribution cost. Note that there is a close rela- 
tionship between MaxSR of the remapping cost model 
and the theoretical metric MaxV. The optimal similar- 
ity matrix solution for MaxSR is provably no more than 
twice that of MaxV. 

The computational gain due to repartitioning is 
proportional to the decrease in the load imbalance 
achieved by running the adapted mesh on the new 
partitions rather than on the old partitions. It can be 
expressed as Titer N &dapt (W^ x - W™), where T iter 
is the time required to run one solver iteration on 
one element of the original mesh, A^apt is the num- 
ber of solver iterations between mesh adaptions, and 
^max anc * ^max are SUm °f w comp On the most 
heavily-loaded processor for the old and new partition- 
ings, respectively. The new partitioning and processor 
reassignment are accepted if the computational gain is 
larger than the redistribution cost. In that case, all 
data is appropriately redistributed. 

4 Results 

The 3D-TAG parallel mesh adaption procedure 
and the PLUM global load balancing strategy have 
been implemented in C, C-f-h, and MPI on an SP2. 
The computational mesh for the test cases in this pa- 
per is one used to simulate an acoustics experiment 
where a l/7th-scale model of a UH-1H helicopter rotor 
blade was tested over a range of subsonic and transonic 
hover-tip Mach numbers. Detailed numerical results of 
the simulation are given in [8]. A cut-out view of the 
initial tetrahedral mesh is shown in Fig. 3. 

In the first set of experiments, a total of three 
adaptions are performed in sequence on this initial 
mesh. Table I lists the size of the computational mesh 
after each level of adaption. Notice that the final mesh 
is more than an order of magnitude larger than the 
initial mesh. This is a steady-state calculation where 
mesh adaption is used to resolve the leading edge com- 
pression and capture both the surface shock and the 
acoustic wave that propagates to the far field. 

Figure 4 shows how the execution time is spent 
during the adaption and the subsequent load balanc- 



Figure 3: Cut-out view of the initial tetrahedral mesh. 



Vertices 

Elements 

Edges 

Initial 

13,967 

60,968 

78,343 

Level 1 

35,219 

179,355 

220,077 

Level 2 

72,123 

389,947 

469,607 

Level 3 

137,474 

765,855 

913,412 


Table I: Progression of grid size through a sequence of 
three levels of adaption for a steady-state computation 

ing phases for the first and third levels. Our heuristic 
greedy algorithm is used to perform the processor re- 
assignment. The reassignment times are not shown 
since they are several orders of magnitude smaller 
than the other times. The repartitioning curves, using 
ParMeTiS [4], are almost identical for the three lev* 
els because the time to repartition mostly depends on 
the initial problem size. The repartitioning times an 
also almost independent of the number of processors 
The mesh adaption times increase with the size of tin* 
mesh; however, they consistently show an efficiency of 
about 85% on 64 processors for all three levels. In 
fact, the efficiency increases with the mesh size be- 
cause of a larger computation-to-communication ratio 
The remapping times gradually decrease as the num- 
ber of processors is increased. This is because even 
though the total volume of data movement increases 
with the number of processors, there are actually more 
processors to share the work. The remapping time in- 
creases from one adaption level to the next because of 
the growth in the mesh size. However, as shown later 
in this paper, the remapping times stabilize when the 
mesh size remains approximately constant. More im- 
portantly, the remapping overhead always dominates 
and is generally about four times the adaption cost on 
64 processors. This is not unexpected since remapping 
is considered the bottleneck in dynamic load balancing. 
It is for this reason that the remapping cost needs to 




Figure 4: Execution times for the three levels of adaption. 


be predicted accurately before a load balancing phase 
to be certain that the data redistribution cost will be 
more than compensated by the computational gain. 

The second set of experiments is performed 
to compute the slope 7 and the intercept O of our 
redistribution cost model. Experimental data is gath- 
ered by running various redistribution patterns. The 
remapping times are then plotted against two metrics, 
TotalV and MaxSR, in Fig. 5. Results demonstrate 
that on an SP2, there is little obvious correlation be- 
tween the total number of elements moved (TotalV) 
and the expected run time for the remapping proce- 
dure. On the other hand, there is a clear linear correla- 
tion between the maximum number of elements moved 
(MaxSR) and the actual redistribution time. There are 
some perturbations in the plots resulting from factors 
such as network hotspots and shared data irregulari- 
ties, but the overall results show that our redistribu- 
tion model successfully estimates the data remapping 
time. This important result indicates that reducing 
the bottleneck overhead, rather than the aggregate, 
guarantees a reduction in the redistribution time. 



Figure 5: Remapping time as a function of TotalV and 
MaxSR metrics. 


The third set of experiments is performed to 
evaluate the efficacy of PLUM in an unsteady envi- 
ronment where the adapted region is strongly time- 
dependent. To achieve this goal, a simulated shock 
wave is propagated through the initial mesh shown in 
Fig. 3. The test case is generated by refining all ele- 
ments within a cylindrical volume moving left to right 
across the domain with constant velocity, while coars- 
ening previously-refined elements in its wake. The 


performance of PLUM is measured at nine successive 
adaption levels. Note that because these results are 
derived directly from the dual graph, mesh adaption 
times are not reported, and remapping overheads are 
computed using our redistribution cost model. 

Figure 6 shows the progression of grid sizes for 
the nine levels of adaption in the unsteady simulation. 
Both coarse and fine meshes are used in the experi- 
ment to investigate the relationship between load bal- 
ancing performance and dual graph size. The initial 
fine mesh is eight times the size of the coarse mesh 
shown in Fig. 3. Note that although an axisymmetric 
cylinder moves through the meshes at constant veloc- 
ity, the sizes of the meshes change erratically due to 
the nonuniformity of the initial meshes. 



Figure 6: Progression of grid sizes through nine levels of 
adaption for an unsteady computation. 

Figure 7 presents the partitioning and remapping 
times for both mesh granularities. Two remapping 
strategies are used, resulting in different remapping 
times at each level. One strategy uses the default pro- 
cessor mapping given by ParMeTiS [4], while the other 
performs processor reassignment based on our heuris- 
tic solution of the similarity matrix. Several obser- 
vations can be made from the resulting graphs. First, 
our heuristic remapper always outperforms the default 
strategy, typically resulting in over a two-fold speedup 
of the data remapping phase. This shows that pro- 
cessor reassignment must be performed using a proper 
metric to minimize the remapping time. Second, when 
comparing dual graph granularities, results show that 
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Figure 7: Partitioning and remapping times. 






the finer mesh increases both the partitioning and the 
remapping times by almost an order of magnitude. 
This is expected since the larger graph is harder to par- 
tition and requires more data movement during remap- 
ping. Finally, increasing the number of processors does 
not have a major effect on the partitioning overhead, 
but causes a noticeable reduction in the remapping 
times. This indicates that our load balancing strategy 
will remain viable on a large number of processors. 

Figure 8 presents the quality of load balancing for 
both meshes. Load balancing quality is defined in two 
ways: the computational load imbalance factor and the 
percentage of cut edges. The load imbalance factor is 
the ratio of the sum of the u; comp on the most heavily- 
loaded processor to the average load across all proces- 
sors. For all the cases, the partitioner does an excellent 
job of reducing the imbalance factor to unity. Using 
a finer mesh has a negligible effect on the imbalance 
factor after load balancing, but requires a substan- 
tially longer repartitioning time. The percentage of cut 
edges always increases with the number of processors. 
This is expected since the surface-to-volume ratio in- 
creases with the number of partitions. Notice that the 
percentage of cut edges generally grows with each level 
of adaption. This is because successive adaptions cre- 
ate a complex distribution of computationally-heavy 
nodes in the dual graph, thereby requiring partitions 
to have more complicated boundaries to achieve load 
balance. This increases the surface-to-volume ratio 
of the partitions, resulting in a higher percentage of 
cut edges. The finer mesh consistently has a smaller 
percentage of cut edges because the partitioner has a 
wider choice of edges to find a better cut. However, 
we believe that this savings in the number of cut edges 
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Figure 8: Quality of load balancing. 


does not warrant the significantly higher overhead of 
the finer mesh. 

5 Conclusions 

We have shown in this paper that our load balanc- 
ing scheme, called PLUM, works well for both steady 
and unsteady adaptive problems with many levels of 
adaption, even when using a coarse initial mesh. A 
finer starting mesh may be used to achieve lower edge 
cuts and marginally better load balance, but is gen- 
erally not worth the increased partitioning and data 
remapping times. Results have also demonstrated that 
our framework scales with the number of processors 
and that our heuristic processor reassignment algo- 
rithm significantly reduces data remapping times. Fi- 
nally, a new remapping cost model was presented and 
quantitatively validated. Results indicated that reduc- 
ing the bottleneck overhead guarantees a reduction in 
the total redistribution time. 
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